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1. INTRODUCTION 

Computers that imderstand speech are expected to facilitate natural man- 
machine interaction, but the problems involved demand the attention of several 
disciplines, including linguistics, computer systems design, perception theory, 
speech research, and engineering. linguistic and perceptual arguments, in 
particular, suggest that devices whicli recognize speech will have to make use 
of grammatical structure ("syntax") in early stages of the recognition pro- 
cedures (Lea, 1972a,bj 1973b; Lea, Medress, and Skinner, 1972a). This can be 
accomplished, in part, by using certain acoustic correlates of prosody, such as 
energy and voice fundamental frequency contours, to segment the speech into 
gra:patical phrases, and to Identify those syllables that are given prominence, 
or stress . in the sentence structure. 

• In this paper, methods are described for (l) detecting syntactic boundaries 
fl-om fall-rise patterns in voice fundamental frequency (F^) contours, then (2) 
l,ocating stressed syllables, within each syntactic unit, as high-energy portions 
of the speech which exhibit significantly high and rising (or, in some cases, 
non-falling) contours. The algorittoic locations of stressed syllables are 
compared with listeners' perceptions of stress, to detennine how well the 
algorithmic results correspond with perceived prominence. 

Once the cobpected speech is segmented into phrases, and stressed syllables 
are located, the ikvac speech recognition strategy would call for a partial 
distinctive feature>u^^is within each stressed syllable. Consonants and 
vowels are expected to be more clearly articulated and easier to distinguish 
in stressed syllables, than in unstressed or reduced syllables (cf. I^a, Medress, 
and Skinner, 1972b), where articulation (and consequent acoustic infoitnation) is 
not as precise or consistent from talker to talker or time to time. 

Next, the partial distinctive features description would be matched with 
generated' or stored patterns for possible stressed syllables or words in the 
lexicon. Then a guess as to the word content of the constituent would be made, 
based on the reliable feature information from the stressed syllables, plus 
other reliable data within the constituent (such as presence of coronal strident 

12 
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fricatives, etc.; cf. Madress, 1972). Each guess as to constituent identity 
would be combined with those for other constituents in the sentence until a 
satisfactory set of hypotheses for all constituents yielded the granimatical, 
meaoiogftil sentence. 

In addition to aiding partial distinctive features estimation, the pres- 
ence of syntactic boundaries and the positions of stressed syllables are 
expected to help guide syntactic parsers (lea, 1972a). For example, an 
investigation has begun of the feasibility of using prosodically-detected 
syntactic boundaries to affect the priority order on transition arcs and the 
pop-ixp procedures in parsers based on Wood' s transition network grammar 
(Woods, 1971). 



In the remainder of this report, the encouraging successes In applying 
the bcundaiy detector and a stressed syllable location algorithm wiai be 
presented. In section 2, the speech texts selected for this research are 
giVen, and their relative merits for prosodlc analyses are outlined. Then, 
in section 3, an algorithm is described for detecting constituent boundaries 
from fall-rise patterns in contours, and its application to the selected ; 
texts is shown to provide successful detection of over 80^ of all predicted 
syntactic boundaries. 

In section 4, ejcperiments are dsscribed which show that several listeners 
rather consistently classified all syllables in the spoken texts into one of 
three categories - stressed* "unstressed, or reduced. Issues of interest with 
regard to these stress perceptions are the effects of individual talkers, 
individual listeners, various texts, how consistent the listener's per- 
ceptions are from time to time, whether the listener can predict stress levels 
given only the written text (without listening to the speech recordings), and 
which stress levels (stressed, unstressed, or reduced) are most consistently ^ 
assigned. The majority decisions of the team of listeners provide the 
standard ^ which a stressed syllable algorithm can be judged. 

.113 
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In sec^tlon 5, an algorithm la described for locating stressed syllables 
witto the phrases delimited hj the constituent boundary detector. This 
algorithnt^^ch Is based on previous Intonation theories and studies of 
acoustic correlltes of stress, assumes that stressed syUablas will be 
accompanied by rising or non-falling and large energy integral. - The 
results show that about^ 855^ of all syllables that were usually judged as 
stressed by a majority of the listeners were also found by the algorithm. 

In section 6, further work Is outlined, to improve the algorithms for 
syntactic segmentation and stressed syllable location, and to combine 
partial distinctive features analysis within the stressed syllables with / 
aids to syntactic parsing. Such efforts would yield critical portions of / 
-Hih»-proposed^spoech^ecognitionratiPfttegy r ■ 



Appendices are Included to detail the results in constituent boundary 
location (Appendix A), perceive^d stress patterns (Appendix B), and the 
results of algorithmic location of stressed syllables (Appendix C). 
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2. SELECTED SPEECH TEXTS 

To test the algorithms for boundary detection and stressed syllable 
location, speech texts had to be chosen, recorded, submitted to listeners 
for stress perceptions, and analyzed by the computer programs. The primary 
text chosen for these studies was the first paragraph of the "Eainbow 
Passage" (Falrbknjcs, 1940). It reads as follows: 

"When the sunlight strikes raindrops in the air, they act like a 
prism and fom a rainbow. The rainbow is a division of white light 
into many ^beautiful colors. These take the shape of a long round 
arch, with its path high above, and its two ends apparently beyond 
the horizon. There is, according to legend, a boiling pot of gold 
at one end. People look, but no one ever finds it. When a man _ 
looks for something beyond his reach, his friends say he is lookxng 
— for^^the^pot of gold a;t the end of the rainbow." 



This text (hereinafter called the Rainbow Script) has been used 
extensively in studies of prosodio patterns in speech, and has the advantage 
of being a well-known semantically-connected text of declarative sentence^, 
with a variety of grammatical phrase structures (cf . Lea, Ifedress, and 
Skinner, 1972a). It was recorded by six taliers (four male, two female) in 
a quiet room at Purdue University. 

In texts like the Rainbow Script, the factors determining positions of 
stress within words (lexical stress) are compounded with sentence structure 
effects on stress (of. Chomsky and Halle, 1968; Hall^ and Keyser, 1971). 
Another text which was composed of only monosyllabic words was also analyzed, 
to eliminate or minimize lexical stress effects. This text, read by two of 
the six talkers who had read the Rainbow Script, is the first paragraph of a 
short story: 

1 

i . ■ 

"John and I went up to the farm in June. The sun shone all day, and 
wind waved the grass in wide fields that ran by the road. Most bxrds 
had left on their trek south, but old friends were there to greet us. 
Piles of wood had been stacked by the door,^left there by the man who 
lives twelve miles down the road. The stove would not last txll dawn 
on what he had cut, so I went and chopped more till th? sun set. 
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Lea (1972a,b) had previously processed recordings of this text (herein- 
after referred to as the Ifcnosyllablc Script) for constituent boundary detection 
at Purdue Ifoiveraity. Comparing his previous results with the boundary 
detections found by the Uaivac implementatioh of his algorithm helped verify 
the new algorithm. 

Both the Rainbow Script and the Ifcnosyllabic Script involve read speech, 
all of declarative structure. To evaluate the boundary detection and stressed 
syllable location techniques with questions, coianands, and declaratives of 
direct utility in mamnnachine Interactions, thirteen sentences were selected 
from actual recordings by five contractors who are developing speech under- 
standliig systems for the Advanced Reaearch Projecta Agency (ARPA) of the 

- DepMent ^oT mf^mw-^^B^r^^r^^-T^^'^^^ 

were not read, but were composed on the apot in aiimolated protocols of man- 
machine interaction. ' The semantic context of each sentence was pertinent to 
a particular task domain adopted by the builder of a speech understanding 
system, such as retrieving information on lunar rock samples (Woods, 1971), 
other information-retrieval tasks. Instructing a robot to move objects in a 
block world (Walker,. 1973), or voice programming. 

Theae thirteen aentencea are aa followa: 

Who' a the owner of utterance eight? 
Diaplay the phonemic labels above the apectrogram. 
Do emy aamplea contain troilite? 

What ia the average uranium lead ratio for the lunar 
aamplea? 

Do you have any right square boxes left? 
Put the other red block on the red block. 
Who is the owner of utterance eight? 
Do toy samples contain tridymlte? 

Would you move the stack of right ci3?cular cylinders to 
the right by half a square? / 
Place the red triangle two squares tlack from the front 
of the floor in the middle. / 

/■ 
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1. 


(LS21) 


2. 


(LML3) 


3. 


(B27) 


4. 


(BIO) 


5. 


(BB6) 


6. 


(EB16) 


7. 


(IW3) 


8. 


(^35) 


9. 


(RA19) 


in. 


(RC8) 
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11. (GVI3OO) Alpha becomes alpha minus beta. 
• 12. (CV23OO) Alpha gets alpha minus beta. 
13. (DIO) Repeat where key work equals Gauss elimination or key- 

word equals eigenvalues. 

The recordings of these 13 "ARPA Sentences" involved ten different talkers, 
each one sa^g one or more of the sentences, indicated by the distinguishing 
alphabetic code for each talker, shown within the parentheses. 

We shall distinguish the first six ARPA sentences (hereinafter referred to 
as "6ARPA Sentences") from the last seven (referred to as "7ARPA Sentences"), 
since the first six are being studied extensively by various ARPA contractors, 
while the seven additional sentences were selected by the author to provide 
-several adaitIoml"xnEerest'ing syntaotlo^donstruoti^ more syntactic 

boundaries than the first six had provided. These sentences show a variety of 
sentence types (three questions with interrogative words (who, what), three 
yes-no questions, four Imperatives > one "polite" command or request, and two 
declarations), with emphasis on questions and commands, which are expected to 
be of major interest to man-computer communications. Some of the structures 
(as in DIO) are not usual English forms, but obey syntax equations being 
designed into the restricted parsers of speech-^mderstanding systems. Yet, 
each sentence has at least one interesting phrase structure or contrast with 
another possible structure, such as the compound nouns in DIO, sequence of noun 
. phrases and prepositional phrases in RA19 and RC8, or the "minJjnal pair" 
contrasts between B27 and B35 or LS21 and DO. 



1. The first letter of the code identifying each sentence, as shown within the 
parentheses of this list, indicates the ARPA contractor which recorded the 
sentence (B = Bolt, Beranek, and Newman; C = Carnegie Mellon TMversity; 
D = Systems Development Cbrporaticrt L = Lincoln laboratories; and R - . 
Stanford Research Institute). The second letter, when it appears, identifies 
which talker from that organization spoke the associated sentence, or, in the 
case of CV codes, it marks the task as voice prograianing. Numbers in the 
code indicate the order in which the sentence appeared in that organization s 
protocol of utterances. This complex code is included here since these same 
utterances are being studied, under. such identifiers, by various ARPA 
contractor-s . 

■ n- 
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The recordings of all these speech texts provide a total of 379 predicted 
syntactic bomdaries and 1128 syllables for evaluating the effectiveness of the 
boundary detection and stressed syllable location algorithms. 
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3. CONSTITUENT BOUNDARI DETECTION 

3.1 Obtaining and Energy Measurements 

The speech recordings for the Rainbow Script, Msnosyllabic Script, and 
ARPA Sentences were digitized and submitted to computer programs that obtained 
fundamental frequency and broadband (5 KHz) energy measures for each 10 milli- 
seconds (ms) of speech. The fundamental frequency measure in Hertz, as provided 
by autocorrelating the center-clipped waveform (Sondhi, 1958) , was also converted 
to eighth-tones, yielding a log frequency sSale for relative measurements. The 
energy measure was obtained, using a 25.6 ms Hanning window, from the sum of the 
squares g^ the time waveform values (Blackman and Tukey, 1958), followed by a 
conversion to a relative (dB) scale. Smoothed spectra from a linear prediction 
scheme (ifakhoul, 1972) and formant traok^s were al30j>btain^ — 
for ^he^present studies except to help determine where in the text each or 
energy effect occurred. \' 

\^ 

The F and energy measurements were plotted versus time by a ^mputer 
plotting program. For examples of F^ and energy plots, see (lea, Msdress, and 
Skinner, 1972a, p. 25) or Figure 12 in section 5 of this report. 

3.2 The Constituent Boundary Detector 

The F measwements were then suhuitted to an algorithm for detecting 
o 

boundaries between major grammatical constituents. This boundary-detection 
algorithm (Lea, 1972a, bj 1973b) !■ based on an assumption that F^ vill usually 
decrease (about .756 or more) at the end of each major syntactic constituent, and 
then increase (about 75^ or more) either at the beginning of the following 
constituent or after any unstressed syllables at the beginning ofHhat following 
constituent. Ebq)erimenting with fundamental frequency contours in over 500 
seconds of speech (including short stories, newscasts, weather reports, and 
excerpts from conversations, spoken by ^ine talkers), Ifia had shown that over 
8056 of all syntactically predicted boundaries were correctly detected (Lea, 
1972a,bj 1973b). Lea had, however, obseWed that about half of all "missing" 
boundaries were due to predicted boundaries between noun phrases and following 

. 1^ 
8 
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auriliary verbs or main verbs. He concluded that such noun phrase-verbal 
boundaries should not always be expected in phonological structure. 

Detecting such syntactic structure from contours is complicated 
by the fact that, at consonant-vowel (and vowel^onsonant) boundaries, 
variations in F occur which may be confused with the changes marking 
syntactic boundLiea. False (syntactically unrelated) boundary detections 
resulted from F variations at these boundaries between vowels and con- 
sonants, but, mo°st such false alarms could be eU^nlnated by setting a 

percent variation (about 10^) in F^ for a boundary detection. A 
detailed study of F^ variations at phonetic boundaries (I«a, 1972a, Ch. 4; 
cf. also Lea, 1973a) clearly indicated that such phonetically-dictated 
changes in F^ would rarely exceed about 10^. 

The boundary detection algorithm also detects clause and sentence 
'^boundaries wherever long (350 ms) stretches of unvoicing (i.e., "pauses") 



occur. 



While several improvements could be made in the original algorithm, 
aBd in the- previous predictions as to where ' syntactic boundaries should 
be detected, the present studies were done with substantially the «ame 
algorithm. Implemented as a K)RIRAN program at the Ibivac Speech Communica- 
tions Laboratory. One exception is that the results to be reported for the 
Rainbow Text were obtained by a hand analysis, strictly following laa- s 
algorithm, but including one refinement which eliminated some false 
boundaries resulting from large (7^ or greater) variations in F^ that only 
last for one 10 ms time sa^le. The original hypothesis that boundaries 
^ould occur between noun phrases and verbals was also maintained, until a 
precise formulation of when it i^ls could be established. As laa had 
previously suggested (Lea, 1972b), boundaries wsre asi predicted between 
pronouns and following verbals. 
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3.3 Boundarieg Detected in the Ra inbow Text 

Figure 1 shows some typical boundary marking for a portion of the Eainbow 
Script, spoken by male talkers ASH, GWH, WB, and JP, and female takers PB and 
ER. Detected boundaries that corresponded to predicted boun'daries are shown 
by vertical bars below the place In the speech where they occurred. Tfapredicted 
F valleyB which could be correlated with lower level syntactic boundaries are 
shown by columns of dots. False boundaries, due to nonsyntactic effects such as 
F changes at consonant-vowel boundaries, are shown by question marks. -When a 
predicted boundary was missing from the detection, an asterisk is marked at the 
predicted position for the syntactic boundary. Sentence boundaries, determined 
by "pauses" of long-termPunvoicing, axe marked by dollar signs ("S's" with the 
vertical bars of "predicted" boundaries). 

Thus, predicted boundaries are shown to be detected for all talkers between 
the copulative is and the object noun phrase a division, and before prepositional 
phrases of white light and into many bea utiful colors. The predicted boundary 
between the noun-phrase subject The rainbow and the verbal is was detected for 
^ive of the talkers, but missed in the F^ contours of talker PB. These noun 
■ phrase/verbal boundaries are more frequently missed in other instances in the 
tejcts, as may be seen from Figures A-1, A-2, and A-3 of Appendix A, which 
illustrate the complete Set of boundary detection results for all the texts and 
talkers . 

Sometimes the rise in F^ into a constituent may be delayed due to initial 
weakly stressed syllables or function words like a, of, into, etc. The bottom 
of the F valley may then be delayed until VLthln such weak beginnings of 
constituents, such as illustrated by the horizontal arrows in the beginnings of 
constituents such as a division , of white light , and into many beautiful colors . 
This delay is considered a predictable result of the stress patterns, and such 
displace^ boundaries are still considered correctly detected. These delays, 
however, illustrate that the algorithm is not locating syntactic boundaries, 
only detecting them. 
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Table I sunnnarizes the boundary detection results obtained from the hand 
analysis of the Rainbow Text, as spoken by the six talkers. Forty-two con- 
stituent boundaries had been predicted W the independent s^mtactic analysis 
based on an intuitive constituent-structure division of the sentences, and 
previous experience with fundamental frequency patterns. Table I shows that 
the number of correctly detected boundaries (second column from the left) 
varied somewhat from talkeir to talker, yielding detection scores (third 
column from the left) that ranged from 6756 to 865^ of all predicted boundaries 
that were detected. The average detection score (79^) is very close to the 
81% scores obtained by Lea in previous experiments with other texts (lea, 
1973b) . ' 

Also tabulated in Table I axe the numbers of "extra" detected boundaries 
(fourth column from the left) that related to boundaries between minor syntactic 
constituents, but which had not been predicted by the particular syntaictic 
analysis used. An improved procedure for predicting prosodically-marked 
syntactic boundaries might include these among the "predicted" boundaries for 
future studies. The last column in Table I shows the number of false 
(syntactically unrelated) boundaries that were found in each spoken text. 
These "false alarms" are considerably reduced in number from I«a' s previous 
results (1972b, p. 66), because of the refinement that requires maxima and 
TpT Ti-iTnR to last for at least two time segments (20 ms). 

All boundaries between matrix sentences, (five per talker) were accompanied 
by long (350 ms or more) durations of unvoicing, g^were thus correctly marked 
as sentence boundaries. However, boundaries between embedded sentences (that 
is, clause boundaries within matrix sentences), ■while always marked as 
constituent boundaries by F^ valleys, were accompanied by pauses in only 14 of 
the expected 24 instances. 

An apparent "sentential pause" that had not been predicted (but which is not 
surprising) occurred aft0r the parenthetical phrase nncnrdlnf^ to legend, for two 
of the six talkers. No pauses of 350 ms or longer occurred at other than such 
major syntactic boundaries. 
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These results for the Rainbow Script are very similar to those found 
ekrller by Lea for another set of talkers reading weather reports, newscasts, 
and other texts, and for short conversational excerpts. 

3.^ Boiindarlea Detected in *h? M^H^ffTllaM" Sorlut 

Figure A-^ in Appendix A shows the complete set of boundary detections 
for the Ifonosyllabic Script, as found 1^ the Ifaivac Uapleoentation of lea's 
original algorithm. These compater-derived results are similar to those 
shown in Figure 1 for the Rainbow Script, and agree substantially with results 
reported by Lea (l972b, p. 199) for two other talkers. 

Table II summarizes the boundary detection results for the Ifcnosyllabic 
Script. ' The scores of 865t'(for ASH) and 805t (for G;ffi)_sho^ 
^lat^ov^enTS^^rth^rT^si^^ti^^^ and 66% correct detection of 

predicted boundaries reported for the same two talkers (ASH and GWH) in lea' s 
earlier hand analysis (Lea, 1972b, p. 56). The reason for this U^^jrovement is a 
revision in the syntactic predictions (based on the previous results with othgr 
talkers) whereby boundaries are not expected (a) between pronouns and following 
verbals (though they are presently still predicted between non-pronoun noun 
phrases and following verbals) or (b) between mS. and the following relative 
pronoun Hfeo- Also, boundaries had (erroneously) not been predicted between 
Piles and of wood , and between the adverbial conjunction tlji and the puq , in 
the earlier work. 

It is expected that other refinements of the boundary predictions could be 
made, and should be based on a more precise theoretically-cohesive set of rules 
for predicting intonation contours from syntactic structure (cf . Blerwisch, 1966) 
A study has begun to devise rules, incorporating some recent work of Jane 
Robinson (Dhiversity of Michigan). 

Half of the missing boundaries (predicted but not detected) were between 
noun phrases and following verbals, so that the planned refinament to not predict 
boundaries in such positions will bring the boundary detection scores to above 
905b. 
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As shown in Table II, each taUcer aleo yielded six extra boundaries, 
about half of which were due to "Tune 2" fundamental frequency rises at the 
ends of sentence* (taastrong and Ward, 1926:? Lea, 1972b, p. 25). A refine- 
ment in the boundary detection procedure iH sentence-final positions can 
readiay eliminate these "Tune 2" effeots (Lea, 1972b, pp. 68-69). 

Seven false alAims occurred in the text by ASH, and three in that by 
GWH. All but one of these can be eliminated by the jre^idiJment (see discussion 
of the Rainbow Script) that requires that each new maxlimm or minijmjm must 
- be maintained (above the 1% threshold for fall or rise) for at least 20 ms. 

3,5 BQundaries Detected in the ARPA Sentences 

Figure A-3 in Appendix A shows the complete boundary detection results 
for the thirteen AHPA Sentences, as obtained by the Uhivac Implementation of 
the boundary detector. For various reasons, the overall boundary detection 
score {7A.%) Is somewhat lower than for the read texts used in previous studies 
(79^ to 9056). For one thing, some of the utterances (e.g., UG) were quite 
monotone in expression, yielding insufficient F^ variations to trigger the 7%. 
thresholds of the boundary deteclor. A few sentences were said vdth several 
hesitation pauses, and somewhat unusual inflections compared to the speech 
previously studied. As showa in Table A-I of Appendix A, the type of sentence 
hs^ some effect on results, although no strong claims can be made about effects 
' of senteiice types from this small amount of data. Siic of the thirteen missing 
boundaries were within compound nouns such as key \foyd> ^f^ilffB ^^JlTnlT'ntion, end 
utterance eight . Despite these variations ftm previous results, it is 
encouraging that 7U% of the predicted boundaries hsejl ^o^nd in these various 
forms of spontaneous utterances pertinent to man-machine interactions. 

Extra boundaries that occurred were sometimes associated with talker's 
hesitations as they thought about what to say next, or with unusual stress 
patterns apparently associated with the spontaneity of the uttersnoe's. Eight 
"false" pauses occurred that were not clause or sentence boundaries, but rather 
were thoughtful- hesitations not to be found in read speech. Some of these 
occurred at constituent boundaries, but not all (cf. Goldman-Eisler, 1961). 
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Seven false constituent boundaries were also detected, all but one of which 
can not be eldiidiiated unless the minima and maxim of are required to 
remain beyond the 1% thresholds for at least thirty milliseconds. 

In Appendix A, a study is described which detersdned the effects of 
varying the threshold for "significant" F^^ falls or rises. A threshold of 
356 decrease or Increase in F^ will allow detection of all but one predicted 
boundary, but will substantially increase the number of false boundaries 
detected, when compared to the 1% threshold value used in the present studies. 
These, effects of threshold were very sUnilar to those previously found (Lea, 
1972b, Figure 2-5 ) for other texts, except that somewhat smaller F^ varia- 
tions'appears to be adequate for boundary marking in the spontaneous speech 
of the ARPA sentences. Int^onational variations thus ap?-ear to be mox-« 
"animated" (i.e., larger) in the reading of texts than In simulated man- 
machine interactions* 

The TJhivac implementation '^f the constituent boundary detector allows 
different thresholds for F^ decreases and increases, a refinement not incor- 
porated into Lea- s earlier algorithm. The threshold studies reported in 
Appendix A show that better 'aoundary detection results (that is, more predicted 
boimdaris^- «re actually detected while fewer- false boundaries are detected) 
when the threshold for F^ fall is greater than that for F^ rise. This is 
consistent with previous studies that have shown a general trend toward 
falling F„ contours, with local interruptions of that falling contour marking 
the beginnings of new. constituents. 

3.6 '^^HTlflrT' °^ T^iT-iHa-Ty- Detection Results 

Table III summarizes all the boundary detectl^on results for the three 
texts, showing percents of all predicted boundaries that were detected, the 
numbers of extra boundaries related to minor constituent breaks, and the 
numbers of false boundaries. 

These rasults encourage one to use F^-detected boundaries in detecting 
significant aspects of sentence structure directly from acoustic data. This 



' 28 



Report No. PX 10146 



UNIVAC 



►J 
< 



w 

P3 
O 

o 

fa 



CO 



[1} 
Q 



Q 

§ 



O no <D 



•s 



CQ 

«i « - 



4) fH O ^ 

•S g 1 



pq 



«4H 

o 



no 
o 



CO 

S ^ 
pq 



O 

^3 



^ pq 



00 



"O 



GQ 
<D 
GQ 

Pi 
00 

+ 



CO 



CO 

00 



CO 



•a 



CD 



GQ 

I 

PQ 

CO <N 



P4 

G 

W DQ 

m CO 
O 



GQ 

DQ ^ 
^ IS 



rj lO 00 
B 



GQ 
G 



CO 



O 



29 

18 



Heport No. PI 10146 

ia true ev^ for spontaneous utterances taken from man^.alne Interactions, 
such as the AEPA sentences. In section 5, we shaTL see that the successes 
(and occasional failures) of the boundary detector play critlcaJ. roles in 
the process of stressed syllable location. 
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4.. PERCEIVED STRESS PATTERNS 

4.1 Brperlmental Methods 

SbcperimentB have aleo been conducted to study the gty^ge pfttternp in the 
Rainbow Script, MDnosyllahic Script, and ARPA Sentences. Actually, a three- 
fold experimental effort is involved in the total study of stress patterns 
(cf. Lea, Medress, and Skinner, 1972a). One aspect is the presentation of 
the scripts to individual listeners, who are asked to mark their personal 
judgments as to which syllables are stress'ed, unstressed, or reduced. A 
second aspect of the studies of stress is the analysis of acoustic correlates 
of stress, and the testing of an algorithm for stressed syllable location 
from acoustic data. A third aspect of the stress studies is the prediction 
of stress levels and vowel reductions from linguistic analyses, including 
syntactic analyses of the sentences in the speech texts, followed by 
application of appropriate stress rules and vowel reduction rules. These 
linguistic predictions of stress may be done with any of several available 
sets of rules for English stress assignment. ^ 

Only the first two aspects of these stress studies will be discussed in 
this report. The linguistic predictions are the 'subject for a future report. 
The algorithmic location of stressed syllables from acoustic data vdU be 
discussed in section 5. Here we consider the experiments on pi^celved stress. 

Listeners' perceptions of stress, provide a standard by which stress 
detections from acbustic cues can be tested. Previous studies have attempted 
to determine how listeners' judgments of stress vary as certain acoustic 
features are varied, usually in synthesized speech (cf . lea, Medress, and 
Skinner, 1972a, pp'. 32-4D). However, few such studies have been concerned 
with the stress patterns throughout sentences; most work was done on isolated 
words such as Tn^vl^^ft^ pairs of noun versus verb (pesnit /permit, etc.). Some 
attempts have bebn made to determine listeners' perceptions of the most 
stressed syllable in a Sentence, or which of two specific syllables is more 
stressed, or whether a specific single syllable is or is not stressed. The 
present experiments extend studies to ail syllables in the sentences. 
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Three listeners (WAL, MFM, and TES) each Individually heard (through 
earphones) the Rainbow Script as recorded by the six talkers, the Monosyllabxc 
Script as recorded by the two talkers, and the ARPA Sentences. Each .xa^ener 
heard clauses or sentences, or other extended portions of the text repeated 
at will by the listener- s rewinding and replaying of the tape. The Rainbow 
Script ^s specifically separated into clauses separated by long pauses, to 
aid the rewinding and replay, while the other recordings were iiot. The 

listeners endeavored to rewind f^ enough to always hear an" entire clause, to 
havo a constant context within which to judge relative stress levels. E^ch • 
listener could listen to the tape portions as often as necessary to mark each 
syllable. ' He was free to back up the tape at his choice, and no tiane limit 
or procedural constraints were placed on him. 

" The listener was instructed to mark (in whatever way he chose), for each 
syllable, whether he heard that syllable as stressed, unstressed, or reduced. 
To faciUtate marking for each syllable, each script was typed on a sheet of 
paper with vertical slashes between syllables (except for the MbnosylOabxc 
Script, in which each word is one syllable). A mark was required for each 
syllable (between two slash marks). The listener received one such sheet 
for each talker and text. An example perception sheet is shown in Plgure Bl 
of Appendix B. 

Each listener repeated the perception test three tiiaes (with no less than 
three days between trials) to establish listener consistency from one tioae to 
another. Also, to establish that the actual speech heard was playing a role 
in stress Judgments, the listeners were also asked to report their stress 
judgments given gnlz the written text. This test with no speech was included^ 
to determine whether the listener's presuppositions, internal "theory" of 
. expected stress patterns, or own way of speaking the sentences was the sole 
source of his decisions, or whether the acoustic data actually was supplying 
cues to stress patterns. These .o-speech stress judgments were also 
in three repetitions, spaced three or more days apart, to test their repeatabxlity. 
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The Rainbow Script . contains 12?' syllables, the Ifchosyllabic Script 87 
syllables, and the AEPA Sentences 171 syllables (71 in 6ARPA, 100 in 7 ARPA) . 
With three repetitions with speech, three without speech, three listeners, 
and with the various speakers involved, this totals to about 13,000 judgments 
of stress levels for syllables in connected texts. In the following sections 
Me will try to sunsmarize these exiienaive results. Section 4.2 presents the 
majority judgments of the panel of listeners about the stress levels of all 
syllables in the texts. The differences between the perceptions of each 
listener and those of ,the other listeners will be explored in section -4.3. 
The differences from one repetition of the experiment to another will be 
presented in section 4-4. In section 4.5, stress perceptions from speech 
recordings ate contrasted to stress judgments given only the written text, 
and implications about the English speaker-listener's rules for linguistic 
stress assignment are considered. Some effects of sentence type (yes-no 
question, WH^uestion, command, declarative , etc.) on stress perceptions 
will be discussed in section 4.6. A summary of conclusions from these 
stress -perception studies will be given in section 4.7. 

4.2 Majority Judgments of Stress Levels 

To provide a single overall decision about the stress level of each 
syllable in each of the texts, majority votes had to be obtained. First, 
for each listened, his majority vote as to the stress level of each syllable 
was found from comparing his three repetitions of the listening test with 
each text. (These judgments of the individual listener will be explored in 
more detail later.) Then the results for all. three listeners wefe pooled, 
as shown in the plots of Figure 2 (and Figures B-2 to B-15 in Appendix B). 
Plot-tsd in Figure 2, for each " syllable in the Rainbow Script read by ASH, 
are the number of listeners whose majority vote says the syllable is 
stressed , minus the number of majority judgments characterizing the syllable 
as reduced . Uhstreaaad judgments were assigned values of zero. Thus, if all 
three listeners heard a syllable as stressed (on their majority decisions 
from three trials), a value of +3 was plotted; if two listeners gave majority 
votes of reducf^d for a syllable, and the other listener perceived it as un- 
stressed, a value of -2 (minus two) resulted. Occasionally (actually, very 
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rarely), one listener's judgment of redaced cancelled another's judgment of 
the syllable as stressed . These oases of opposing judgments are marked on 
Figure 2 (and Figures B-2 to B-15) by double-ended arrows (♦) below the j 
corresponding syllable of the text, 
t 

The syllabloa which were most d^rinitely atreaaed (i.e., perceived by 
all listeners as stressed) thus were at the top of the scale j those 
definitely perceived reduced were at the bottom of the scale. Pifem such 
results, one can readily see which syllables are unanimously judged as 
stressed, which are judged as stressed by a majority of the listeners, etc. 
When syllables such as long , round , path in the second sentence shown in 
Figure 2 are unanimously judged as stressed, one can be more confident that 
acoustic cues to stress are to be found. In section 5, we shall assume that 
all syllables which had an overall stress score of +2 or +3 are gtressed, 
and should be found by the algorithm for stressed syllable location. 

From Figures 2 and B-2 to B-15 (in Appendix B), one can observe that 
about 4D5t of all syllables were judged as stressed (stress score (SS) of +2 
or +3) by the panel of listeners. About 255^ were judged unstressed (SS = +1, 
0, or -l), and about 35/6 were judged reduced (SS = ~2 or -3). 



Thus, if one were to analyze only the stressed syllables, as suggested 
in section 1, the distinctive-features analysis could be avoided in the 60^ 
of unstressed and reduced syllables, where distinctive-features analysis is 
presumably jnore difficult and unreliable. 

^.3 Effects of the Indi- ^^i-i^T T.-< g ^tener on S tress Perceptions 

It is obvious from the plots of stress scores in Figure 2 and Figures 
B-2 to B-15 that listeners often differ in their judgments of stress levels . 
Here we consider those differences in some detail. 

Suppbse we first consider the syllable -by-syllable differences in majority 
stress judgments between the listeners. (We consider here the majority 
decisions from three trials by one listener, compared to corresponding majority 
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judgments from three trials by another listener. ) ^ery ti^ a syllable Is 
called stressed by one listener and unstressed by another, we have what we 
might call a listener-to-listener "confusion" in stress levels. Similarly, 
one listener's judgment as reduced and anotl^er listener- s judgment as un- 
stressed (or even stressed) represents a confusion. All of these differences 
in assigned stress levels can be summarized in confusion matrices, such as 
previously illustrated by Lea, Medress, and Skinner (I972a,b). With so many 
texbs, talkera, and listeners, the number of confusion matrices is exbremely 
large (but they are available for those who may be interested in studying 
them). The primary conclusions are, however, summarized in the plots of 

Figures 5 ^nd 4. , - * 

Figure 3 shows the percentages of all stress level judgments that differ 
from one listener to another, plotted for each text and talker and for both 
conditions of SIWE (listeners hearing the speech recordings) and NO SPEECH 
(individuals judging the stress levels from /the written text only). There is 
li^-tle variation in the percentage of confusions between listeners for different 
texts and talkers, and speech versus no-speech conditions. However, there is a 
prominent effect due to the individual listener. Listeners ^ and MEM show 
different judgments for about 20 to 30% 6f all syllables. These confusions 
are considerably fewer than those between listeners and TIS (30 to 55%) and 
between MFM and TES (about 4$ to 60^). / It is .apparent that listener TES gives 
results that are markedly different from those of the other two listeners. 
Listeners WAL and MFM are more alike. 

Figure 4 Illustrates an even mpre serious way in which listener TES 
differs from listeners WAL and MFM. Confusions (from listener to listener) 
between stressed and unstressed syllables are separated from those" between 
unstressed aad reduced syllables. The white bars show the percentages of 
unstressed-reduced confusions for each text, talker, and condition. The 
cross-hatched bars show corresponding percentages of confusions between 
stressed and unstressed levels . The extreme confusions between stressed 
and reduced are shown by dark bars. Listener TES actually labelled as 
reduced some syllables which the other listeners called gtrefiged. 
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Figure 3. Percentages of Stress Judgments that Differ from One Listener To 
Another, for Each Speech Text, and with Each Speaker and the NO SPEECH 
Conditions. Plotted are percentages of confusions between listeners WALversus 
MFM ( • • ), WAL versus TES (-•-»-), and MFM versus TES 
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The important fact shown by Figure 4 is that only about 2 to 8% of all 
syllables were judged as stressed by listener MkL and unstressed by MFM, or 
•vice versa, while stressed-unstressed confusions were much more "frequent 
between listener TES and either of the other two listeners (17 to 31$ for 
WAL vs TES, and about 18?6 to 2856 for MFM vs TES) . This "is critical since 
listeners' judgments of stressed syllables will be used to evaluate the 
algorithm for stressed syllable location. 

These frequent differences in assignment of stressedness'to syllables, 
and the occasional extreme confusions between stressed and reduced syllables, 
suggest that one must be very careful how he pools the results for listener 
TES with those for listeners WAL and MFM. Our procedure for overall stress 
assignment by adding the stress scores for each listener yields a result 
which assigns stress to a syllable (for comparison with the location algorithm) 
whenever WAL and MFM agree that it is stressed, fflSSSEi ^ the extreme case 
where TES calls that same syllable reduced. (TES never called a syllable 
stressed when either of the other listeners didn't call it stressed.) 

These differences between listener TES and other listeners were observed 
early in our stress perception studies (Lea, Medress, and Skinner, 1972b). 
Listeners WAL and MFM also were found to yield results quite similar to those 
of four other listeners usaa in previous studies at Purdue Tfaiversity, while 
TES gave quite different results. However, for consistency, the experiments 
were continued maintaining the same three listeners throughout. 

A reasonable conclusion might be to reject listener TES. Yet, one might 
argue that it is conceivable that TES is actually giving the judpients 
closest to the "true" stress levels of syllables, and the other listeners are, 
wrong and should be rejected. lacking any way of deciding "true" stress 
levels, how does one decide the issue? After all, as has been pointed out in 
previous reports (Lea, tfedress, and Skinner, 1972a,b), listener TES is much 
more demanding about the characteristics of a stressed syllable. His strategy 
of stress classification demanded that a syllable be very prominent before it 
was classified as stressed. Such syllables will presumably have the most 
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mrked acoustic correlates of high energy, high and rising F^, and long 
derations. Thus, an algorithm for stressed syllable location should be more 
successful in finding the fewer number of syllables that he categorizes as 
"stressed" than in finding all those categorized as stressed by less 
demanding listeners. It is th«n easier to get high "hit" rates in stressed 
syllable locations using TES' s judgments. We have chosen to take the ^more 
challenging goal of finding all syllables that were judged stressed by a 
majority of the listeners. 

' In section 4-4, evidence will be given that does suggest that listener 
TES be^ rejected, not just because of his differences from other listeners, 
but also because listener TES is not as consistent from repetition to 
repetition of the experiment. 

It may be useful to "screen" listeners for future experiiaents, to 
determine their consistency from repetition to repetition and their general 
similarity to other listeners. The stability of results shown jin Figures 
3 and 4 regardless of text or talker suggest that such screening might bs 
done with a minimum amount of speech, such as one or two talkers reading 
one or two short texts. 

4.4 nnn f^latencv n -P P»rReDtioiifl Frnm Time to Tltpe 

Stress perceptions were obtained from several trials by each listener, 
for eanh text and talker, to establish listener consistency from tims to 
time. Thus, for example, listener mL might listen once to talker ASH 
reading the Rainbow Script, then listen to the same tape again several 
(three or more) days later, then listen a third time after another few 
days. Periods between trials varied from as few as three days to as long 
as six or seven months. Results were reasonably consistent regardless of 
the period between trials, provided that the period was one week or more. 
For some trl^« separated by only a few d^ys, the listeners reported that 
they could remember some of their previous assigmoents. Future studies 
should require a ^^r.^Tmm of one' week between trials. 
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Figure 5 illustrates the percentages of all judgments that differ from 
one trial to another. This is compiled for each text and talker, and for 
the NO SPEECH conditions, using the following procedure. For a given recording, 
the perceptions on trial A are compared to those for trial B. Fo^each syllable 
that they differ (such as syllable air being judged stressed on onb^rial and^ 
unstressed on the other) , one confusion would be shown off the main diagonal 
of a confasion matrix. The number of syllables whose two trial judgments 
differ (yielding off -diagonal instances in the trial A versus trial B confusion 
matrix), divided by the number of syllables in the text, gives the percentages 
of syllables confused from trial A to trial B. This is repeated for trial B 
versus trial C, and for trial A versus trial C. The averages of these three 
percentages of (off -diagonal)' confusions is the value plotted for each text 
and talker in Figure 5. Results are shown separately for each listener's 
confusions from trial to trial. There is thus a very large amount of confusion 
data suimnarized in Figure 5. . 

The results show that listeners are fairly consistent f^om trial to 
trial, regardless of text or talker. That is, less than 2lS of all judgments 
vary from trial to trial. For the Rainbow Script and the 7ARPA Sentences, 
results are quite similar from listener to listener and from talker to talker, 
or even from talker to NO SPEECH conditions. However, listener TES yielded 
considerably more trial-to- ;rial confusions. than listeners WAL and MFI-J for the 
Monosyllabic Script, where his more frequent stressed-4instressed confusions 
were undoubtedly affected by the many stressed syllables occurring in texts 
of monosyllabic words. Trial-to- trial confusions were particularly numerous 
in the 6AEPA sentences, especially for NO SPEECH conditions. We shall see 
later that this was in part due to the questions and unusual pauses and F^ 
vai^tions involved-inr^hese spontaneous utterances. 

Figure 6 presents a breakdown of repetition-to-repetition confusions 
into those between stressed and unstressed , unstrepped and reduced , and stressed , 
and reduced , for each listener. As in Figure 4, where listener-to-listener 
confusions were plotted, it is apparent that listeners WAL and MFM showed few 
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(1 to 9%) stressed-^stressed (and no stressed-reduced) confusions, while 
listener TES gave many more confusions (7^ to 14$) from trial-to-trial. 
In fact, listener TES produced more stressed^stressed confusions than 
tmstressed-reduced ones, Since the primary iBterxt o^ the streae perception 
studies is to provide stress standards by which a^ssed-syllable locator 
nay be judged, such confusions about stressed Bjp^^leB are crucial. The 
lack of repeatability in stress judgments, whejd' coupled with the other 
xmusual characteristics of TES judgments, wojiid seaa to be unacceptable ±n 
future studies of stressed syllables. 

We shall see in sf4ion 5 that the stressed syllable location algorithm 
locates about 85$ o^ all syllables perceived as stressed by t^e majority of 
listeners. It, tbds, misses about 15$ of the stressed syllables, and it , 
labels about 15^-of the syllables as stressed even though they were not 
perceived as stressed by two or more listeners. When the peitception "standard" 
whereby the algorithm is judged varies from time to time by the same order of 
magnitude as the differences between the perceptions and the acoustically- 
derived decisions, it can hardly be called a "standard" any more. We desire 
that the past results with the standard accurately predict the next results 
when applying that standard again to the measurement of the same data. We, 
thus, must reject TES data for providing an evaluation of stressed syllaole 
location to any closer than 10 or 15% or so. 

Even with the perceptions of listeners WAL and MFM, we must realize that 
the confusions of about 5$ or so from time to time suggest we can not judge 
the effectiveness of stressed syllable location to any precision greater than 
about 5%, If a stressed syllable algorithm locates 95$ of all syllables 
perceived as stressed by majority votes of two or more listeners, it is doing 
no worse than one repetition of the perceptions would do for predicting the 
perceptions from a second repetition of the «tper3ment. We thus have no 
motivation to attain 98$ "correct" location of stressed syllables versus 95$, 
etc., as long as we use the present form of listener perceptions as the 
standard. 
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One might speculate that a new procedure for obtaining listener 
judgments of stress levels, such as allowing a scale of 1 to 10, or an 
assignment of any arbitrary number to each syllable, might conceivably 
yield improved (more stable) perception results. However, it is doubtful 
that increasing the number of levels into which stress is categoriae^^^^ 
actually improve the stability of results. A confusion of level 6 and 7 
(on a 10-level scale) from repetition to repetition is still a confusion 
even though it may be said to be a finer-grained, or smaller, confusion 
than a stressed^stressed confusion. One might try to define metrics 
for measuring the size of such confusions, and try to suggest that the 
overall confusion is decreased in some sense. However, it is important 
to realize that an experiment so defined does not define an interval 
measurement scale, in the measurement-theoretic sense (Stevens, 1951; 
1969; Lea, 1971), and no such metrics would be justified in terms of the 
abstract structure of the perceptual scale. The present experiments 
define an ordinal measuremeht-theoric scale, which distinguishes three 
nomina classes (stressed, unstressed, reduced) with an ordering (stressed 
is "greater" than unstressed, unstressed is "greater" than reduced) , but 
no defined intervals (we have not required or demonstrated that the 
"distance" or difference from stressed to unstressed is equal to that 
fixjm unstressed to reduced, etc.). 

Since confusions jdo occur from repetition to repetition of the stress 
perception experiment,! mkjority votes from three or more trials would seem 
to be suitable for obtiiidng somwhat more stable results. The aiajbrity 
votes from three trials are eapected to be more like those majorities 
from three more subsequent trials than the single trial-to-trial judgments 
would be. 

^.5 Comparing Stress Judianent a With Speech to Those Withottt gp?9Pb 

The general consistency with which most listeners can assign stress 
levels to syllables in connected speech (li, Hughes, and Snow, 1972; Lea, Medress 
and Skinner, 1972a) suggests that there is indeed some psychological 
reality to the concept of stress. The fact that listeners assign 
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approxiBatelv the sazne stress patterns to the speech of different talkers 
reading the 'asm text suggests that either (a) the ta3Jcers are all con- 
sistently conveying something that we might call the normative, unmarked 
pattem\of linguistic stress for that structure and content of the text, 
or (b) the listeners are assigning stress levels not so much on the basis 
of this stable input acoustic data, but rather on the basis of their own 
internalized theories of stress or their projection of how they would 
have said the same text. . 

Some evidence is already available to discount the idea that the 
acoustic data plays absolutely no. role iii stress perceptions. Previous 
research on acoustic correlates of stress have shown that listeners do 
change their stress/judgments as acoustic parameters are varied under 
various controlled conditions (cf . e.g., Uebennan, 19^0 j 1967; I^biste, 
1970). The data in the present experiments (see Figures B-2 to B-15 in 
Appendix B) show some differences from talker to taUcer, for the same 
tex^o which are consistently shown in the assigned stress levels of all 
listeners. The listener is indeed making his stress Judgments based at 
least in part on the acoustic data, and not siM^U on the basis of 
S^SCte^ patterns. It would appear that talkers are generally assigning 
equivalent stress patterns to the texts they speak, pres^ly f ollovri^g 
an unmarked "Unguistic" stress patte^^^^^^^^ ^7 
and structure of the sentence, but that the individual talker will 
occasionally deviate from a strict pattern, perhaps assigning added 
emphasis to certain words, pr reducing certain syllables one time whereas 
he (or someone else) may not do quite the same thing the next t^ he spoke 
the same text • 

If the talkers did in fact approximate to, but not alvsys exactly 
attain to, a standard linguistic stress pattern, and if, the listeners used 
their own internal notions of stress flfli the acoustic data to assign stress 
levels, but were not perfectly consistent in setting the boundaries between 
the categories of stressed, unstressed, and reduced, we might expect all 
the general consistencies and minor iiiconsistencies that have been found 
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in perceptions of various listeners from repetition to repetition, talier 
to talker, and text to text. We mi^t expect somewbat more agreement about 
stress in read speech than in spontaneous utterances if the listener's 
a priori way of assigning stress agreed more with the unmarked pattern 
ejcpected in reading texts than with the possible special emphasis, reduc- 
tions, pausing and restarting and other variations that are introduced by 
spontaneous speech. If the listener were making stress judgments entirely 
on the basis of acoustic data, and had no added difficulty in making 
acoustic distinctions for spontaneous speech, .we would ejcpect his judgments 
for spontaneous and read speech to be equally consistent . 

Further stress studies are needed to answer many questions about how 
listeners perceive stress, ^low their own internal models interact with the 
acoustic data, whether there is a consistent nonnative or unmarked stress 
pattern used by both talkers and listeners, hQw spontaneous speech might 
differ from read speech in spoken and perceived stress patterns, etc. 



+ 



Included in the' present studies were experiments on stress judgments 
given only the written text, which were to be compared with the same 
person's stress perceptions using the speech recordings. These NO SPEECH 
judgments have been included in the summaries of Figures 3 to 6 with no 
previous attempt to contrast them with the results with speech. Here we 
specifically explore the differences and similarities that result. 

The listener-to-listener confusions in stress levels, as shown in 
Figure 3 (and in Figure 4), show no marked differences between numbers of 
confused perceptions with speech recordings to numbers of confused 
judgments without speech. We might have expected that if the listener' s 
own stress theory (or his own way of assigning stress to the text if fee 
were to read it) were playing an active, dominant role in stress assign- 
ment from the written text alone, and if his theory played much less of a 
role in U-stenlng to the speech recordings, as^±f the internal theories 
of the listeners differed much, then the listener-to-listener confusions 
without speech should be substantially more than those with the equalizing 
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effect of the acoustic data. B^, in fact, there Is no elgnlflcant difference 
between the percentages of listener-tcUstener confusionB siit versus Sithout 
speech. Thus, either the listeners are each asslgnli.g BubstantlaOly the same 
stress patterns whether the speech Is present or not (and thus some Internal 
theory is playtag a . dominant role under both conditions) or else, they all 
„ry in similar manners In how they change no-speech judgments to perceptions 
with speech. 

.> 

Suppose one couLd show that stress judgments exhibit inany more, confusions 
from repetition to repetition when only the written text is given, when 
compared to the percejition confusions from repetition to repetition with 
speech. Then he could argue that the present stress perception experiiaents 
using speech recordings are more useful than just haying native English 
subjects predict stress patterns from the written text. He could also argue 
that this is evidence that the acoustic data were critical in obtaining 
reliable stress assigtments. ' Surprisingly, this did not turn out to be true 
inlthe present experiments'. Figures 5 and 6 show that, with the possible 
exLption of the results for the 6AEPA Sentences, the nmber of repetition- 
tf repetition confusions without speech is not significantly larger than the 
ar of confusions with the speech. 

A related issue is whether the stress judgments without speech agree 
substantially with the perceptions with speech. That is, can one accurately 
predict the listener- s perceptions with speech from his judgments Hiifeoat 
speech (or vice versa)? While judgments without speech may be Consistent 
fromtl^e to time, and while numbers of listener-to-listener confusions 
may be comparable with or without speech, the syllable -by-syllable judgments 
without speech may or may not correspond with those with speech. Figure 7 
illustrates the results of comparing the majority decisions (for three 
trials) of each listener with speech to his majority decisions without 
speech. Plotted are the percentages of all syllables in the texts that are 
assigned different stress levels With speech from those assigned without 
speech, for each listener. 
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Figure 7. Percentage of Confufloni In Assigned Stress Levels for NO-SPEECH 
versus SPEECH (Renditions, for Each Text and Talker. Plotted are percentages of 
confusions for listener WAL ( © ), listener MFM ( ® ), and listener TES ( (S) ). 
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It is evident from conning the results of Figures 7 and 5 that, for 
the Rainbow Script, comparabXe percentages of confusions occur for repetition- 
to-repetition comparisons (Fi^e 5) and speech-to^o speech comparisons (Figure 
7). That is, the NO-SPEECH stress judgments do as good a job of predicting 
stress perceptions with speech asVne repetition with speech would do in 
predicting the results of another i^t^etition with speech, for the Rainbow 
Script. For listener MEM. the mjority NO SPEECH judgments for most texts are 
more like the majority judgments with ^eech than one repetition with speech 
is like the nert repetition with speech .\on the other hand, listeners WAL 
and TES usually show more confusions betw^ SPEECH and NO SPEECH than between 
repetitions with speech, particularly for t^^ Monosyllabic Script and the ARPA 
Sentences. (The probable reason these listeners did not show more SPEECH vs 
NO SPEECH confusions for the Rainbow Script is Wt for that text,' the NO- 
' SPEECH judgments were done aftgt the listenera h^d done three tests with the 

speech, ffid discussed the results, so their NO-SPEECH judgments could have 
been biased by previous experience with the speech. For the Monosyllabic 

Script, the NO^PEECH judgments were obtained before any test with the speech. 

For the AHPA Sentences, some NO-SPEECH tests were perfonaed before, and some 

after, the tests with speech. Data analyses for those texts were done after 

all experiments had been performed.) 

The vast difference in SPEECH vs NO SPEECH confusions for the Monosyllabic 
Script might suggest that listeners vary in their relative success of assigning 
lexical versus structHre^ctated aspects of stress. Listener MEM shows very 
little (256 or 3%) confusions for the Ifcnosyllablc Script, perhaps indicating 
that he can assign sentence stress very consistently. His higher rates of 
■ confusion (6^ to 18^) with other texts (notably, the 6ARPA Sentences) suggest 
th^t he has more difficulty when lexical stress factors of polysyllabic words 
also are involved. Listener TES, on the other hand, is quite inconsistent in 
assigning stress to the Ifonosyllabic Script, perhaps suggesting more difficulty 
with sentence structui^e aspects of stress assignment. An equally revealing 
and perhaps more plausible explanation is that the Monosyllabic Script has a 
higher i«rcentage of stressed syllables than the other texts. We have already 
seen that MFM has considerably fewer stressed-unstressed confusions than 
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unstressed-reduced, while TES confuses considerably more stressed and 
unstressed syllables. 

The relatively high confusion rates shown In Figure 7 for the 6ABPA 
Sentences (and in Figure 5 for NO SPEECH confusions from repetition-to- 
repetltlon) suggest that we cannot rely on stress judgments using only 
the written text to give the best predictions of perceived stress levels 
for aTX)ntaneoua utterances suitable for nu^rf^^n.h'irift Interactions. Thus , 
while stress judgments without speech recordings may do a surprisingly 
good job of predicting perceived stress patterns for normal speech read 
from texts, they are not the best foim qf stress judgments for spontaneous 
speech. Stress location algorithms to be used in speech understanding 
systems should be judged by siress perceptions obtained from speech 
recordings, not from judgments about orthographic transcriptions. 

4..6 Effects of Sentence Tvue on S +'^«»»p -Tnt^fTiwunta 

In compiling the confusions for the various texts, it was evident not 
only that confusions were more common in the 6ABPA Sentences, but that 
questions seamed to exhibit more confusions than declaratives or commands. 
In Figure 8, the thirteen ABPA sentences are separately listed, along with 
symbols that Indicate the basic category to which that sentence structure 
belongs (yes-no question, T/N?j question with interrogative (WH) word, WH?j 
command, C; polite command, PC; and declarative, D). Plotted for each 
sentence is the percent of all syllable stress level comparisons that 
differed from repetltlon-to-repetltlon, for the three trials with speech 
(Figure Sa) or without speech (Figure 8b), or the percent of all syllables 
that differed between the majority vote with speech and the^majorlty vote 
without speech (Figure 8c). Results were pooled for all listeners, by 
first finding the plot for each individual listener, then averaging the 
values for all three lligteners for each sentence. 

r' 

Several points may beWde from these results. First, percentages of 
confusions tend to be highet for the 6ARPA Sentences than for the 7ARPA 
Sentences in Figure 8b and c, presumably because the NO SPEECH judgments 
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for the 6ARPA Sentences were obtained before the listeners heard the speech, 
while, for the 7AHPA Sentences, some NO SPEECH judgments were made after the 
perceptions with the speech had been obtained. The WITH SPEECH perceptions 
seemed to have been remembered, to help -stabilize stress decisions in later 
trials. I 

Of more importance here are the high confusion percentages that occur 
in Figure 8a, 8b, and 8c for Yes-No questions 3, 5, and 8, and for WH 
question 7 in 8c. In general, questions (especially yes-no questions) seem 
to exhibit more confusions than declaratives and most commands. 

Another -Way in which stress perceptions are significantly influenW 
by the sentence type is in terms of how much the different listeners var%d 
from each other in their consistency of stress assignment for each sentence. 
Ustoners differed by in the percentage of confusions which they exhibited 
from NO SPEECH to SPEECH for questions 7 and 8, 2^% for yes-no question 3, and 
205C for yes-no question 1, but less than an average of 13^ for all of the 
other ARPA sentences. Similarly, in repetition-to-repetition confusions 
without speech, the greatest variations in rate of confusions occurred for 
yes-^o questions 3 and 8, and WH questions 1 and 7 (as much as 30^ compared 
to .an average of 11^ for the other ARPA Sentences) . 

Jtom these preliminary resuJ.ts, it appears that stress assignment is more 
difficult in questions than in other sentence structures. Further, more 
controlled tests with various sentence types woiald be needed to confirm these 
apparent trends obtained from only 13 sentences. These tests will be under- 
taken using the extensive set of sentences presently being designed for isolating 
various factors affecting prosodic patterns (cf* Lea, Madress, and, Skinner, 
1972a, pp. 56-7). 

4,7 General Conclusions About St ress Perceptions 

' The above exte^sive analyses of stress assignments by three listeners 
have yielded the following general conclusions: 

42 



, PI 10146 

Different "listeners assign different stress levels to the same 
syllables, presumably based on how they individually define the 
Wdaries between categories of stressed, unstressed, and 
reduced syllables. Their confusions are not seriously increased 
or decreased in going from individual talker to talier, or from 
text to text (except when questions are introduced; see point 8 
belowA. 

Usteners WAL and MFM, who have been shown by previous experi- 
ments to yield stress perceptions' very much like those of other 
listeners, differed in as much as 25 to 30^ of their majority 
decisions about stress levels (compiled from three trials). 
However, only about 556 of all syllables were confused between 
the categories stressed and uagtmSSd. Ttns, judgments of which 
syllables wer^ stressed agreed very well between listener 
WAL and listener MFM. 

Listener TES differed from the other two listeners on about half 
of bis stress decisions. About 20 to 25^ of all syllables were 
labelled stressed by other listeners, but ungtregfifid by TES. He 
actually even labelled as reduced some syllables labelled stressed . 
by the other listeners. Also, listener TES labelled substantial 
percentages (as much as 155^) of all eyUables as fitsfififlei on one 
trial and ungtregfied on another. Future studies should incor- 
porate a procedure for rejecting such listeners who provide 
inconsistent judgments about stressed syllables. 

. From repetition to repetition of the perception tests, listeners 
WAL and MBM individually showed quite stable judgments as to which 
syllables were stressed . An average of 5% of all syl^^bles were 
confused between stressed on one trial and ffigt£egge^ on another 
trial. They thus provide a reasonably stable "standard" as to 
which syllables are stressed, for comparison with algorithm results. 
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5. . Majority votes obtained from 3 or more trials shoixld be use^ to 

partially obliterate the 556 deviations in assignment of stressed . 
syllables from trial to tria^,. No atreased syllable location 
algorithm need find more than 95S6 of all ayllables perceived as 
. stressed, since it can hardly be more "accurate" than one 

perception trial is in predicting the perceptions to be attained 
on another trial. 

6. Since listeners agreed in many of the differences they assigned 
to the stress patterns of different talkers reading the same text, 
the acoustic data appears to play at least some role in stress 
perceptions. However, since listener-to-listener confusions and 
most repetition-to-repetition confusions were not significantly 
increased when only the written * >xt was used, it appears that the 
listeners also make use of a reasonably stable internal theory for 
stress assignment. 

7. When listeners had not done the perception tests with speech before 
they did the stress assignments from the written text alone (as 
with the Monosyllabic Script and the ARPA Sentences), their majority 
judgments without speech differed more from their majority percep- 
tions with speech than the repetition-to-repetition with speech (or 
without speech) had differed. Thus, while stress judgments without ■ 
speech are as consistent from listener-to-listener and from repeti- 
tion-to-repetition as are perceptions with speech, the judgments 
made without speech are significantly different from those made with 
speech. In particular, perceived stress patterns for spontaneous 
utterances are not reliably obtained from jud^nents based only on 
the written text. 

8. Questions (especially yes^o questions) appear to yield^^more confusions 
in stress levels (from repetition-to-repetition) than other sentence 
structures (declaratives or commands), and show greater "variability 
from listener-to-listener. 
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m suMiary, the stress perceptions obtained from the trials with speech, 
by using majority decisions for each listener, and pooling results for the 
listeners by the sm (-3 to +3) plots as shown in Figure 1, provide a 
"standard" of stress assignment which is stable to within about 5%. This 
thus peiMts comparisons (to within 5^) between perceived stressed syllables 
and stressed syllables located by algorithm from the acoustic data. 
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5. .STRESSED SYLLABLE LOCATION FROM ACOUSTIC DATA 



.5.1 Correlates of Stress in Contours 

Stress Is an abstract quantity usually considered to be associated with 
a speaker's total physical effort in speech production or with a listener's 
perception of "prominent" syllables. Having obtained extensive data on 
listeners' petys^ptions of stressed, unstressed, and reduced syllables 
(section 4)., we^sfiill now consider howsthe perceptions relate to acoustic 
data. 

Ebctenslve work has been done on acoustic correlates of stress (cf . reviews 
by Lehiste, 1970, Mdjfe_^ess a^^^ 

of stressed syllable production (cf. review by Meberman, 196?). Many studies 
have taken advantage of the ability to separately control acoustic features of 
synthesized speech, to test how acoustic variations, correlate with listeners' 
perceptions of stress (Pry, 1955, 1958; B^linger, 1958; Morton and Jassem, 1965; 
Ifattingly, 1966; etc.). Most experimental studies have been concerned with 
stress in isolated words, short phrases, or short Isolated sentences. 

For reasons detailed previously (Lea, Medress, and Skinner, 1972s), the 
stress perception studies reported in section 4, and the studies of acoustic 
data reported in this section, are concerned with stress patterns in semanti- 
cally-cocnected texts and computer instructions or queries, spoken by several 
different native English speakers. Acoustic correlates of stress that will be 
incorporated into the stressed syllable location algorithm are (l) fundamental 
frequency (F^) variations (particularly local increases in fundamental frequency) 
and (2) the energy integral within the syllable tincorporating both amplitude 
and duration measures into one jneasurement) . 

While many Studies have ^hown that higher F^ is associated with stressed 
syllables (Boilnger, 1958; Fry, 195S| Lieberman, I960; Morton and Jassem, 1965; 
Lehiste, 1970; Lea, 1972b), others have shown that even better correspondence 
Is to be found between local increases (or, occasionally, decreases) in F^ and 
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stress then Is provided by the absolute peak (or mean) values of within 
stressed vowels or syllables (Bolinger, 1958j Msdress and Skinner, 1971; 
Morton and Jassem, 1965). Some studies BUggest that it is the presence of 
such changes that marks stress, not the specific magnitude of the change 
(5Vy, 1958; Morton and Jassem, 1965). 

Effects of phonetic sequences may interfere with these stress effects 
on F contours. Vowels articulated with higher tongue position have F^ 
values that are about 10 to 15? higher than those with low tongue position 
(House and Fairbanks, 1952; Lehiste, 1970; Lea, 1972b, 1973a). This is one 
reason why absolute values of F^ may fail to mark stress; an unstressed /i/ 

^ m^-liave^ higher-peak--oi.jaeanJL^ ±han_-a-^ - 

or mean F^ in a vowel is higher when the preceding consonant is unvoiced 
than if it is voiced or if no consonant precedes the vowel (House and 
Fairbanks, 19^; Lea, 1972b). More important with respect to the F^ changes 
associated with stress are the sudden F^ changes that occur around consonants 
(of. Lea, 1972b, Chapters 4 and 5). Fundamental frequency suddenly drops 
about 10% within the closure period of voiced obstruents, suddenly rises 
again' at opening of the closure, and continues to rise (about 15^ or more) 
during the 100 ms after the following vowel onset. For '^voiced consonants , 
F will cease (sometimes ^ter the 10^ dip at closure, since voicing frequently 
c°eases after closure) and then, when voicing resumes, F^ will start quite high 
and rapidly fail. These dips and sudden cusps in F^ contours must somehow be 
distinguished from stress-dictated F^ changes. 

Another influence on F^ contours must be considered in establishing 
acoustic correlates of stress. Intonation studies (Armstrong and Ward, 1926; 
Lieberman, 1967; Lea, 1972b) have shown that, in connected texts and spoken 
sentences, F will usually reach a ma^dmum near the first stressed syllable 
(the so-called "HEAD") of each breath group or clause, and will fall gradually 
until the la^t 'stressed syllable, after which may occur either the rapid fall 
of an utterance-final "Tune I" contour or the rise in F^ at the end of "Tune 
II" contours (which mark "Incompletlon" ) . Figure 9 illustrates the general 
shapes of these basic intonation contours. Obviously, the last stressed 
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Figure 9 . Tune I and Tune II Intonation Contours 
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syllable of Tune I contours will not consistently exhibit the rises generally 
assumed to accompany stressed syllables. Also, unstressed syllables in the 
terminal rise of a Tune II contour will be accompanied by rises that do not 
mark stress. On the other hand, these studies suggest that the peak F^ of the 
contour yin be associated with a stressed syllable. 

The assumption of the constituent boundary detector is that sentences 
consisting of several major grammatical constituents will be broken into several 
Tune I- or Tune Il-like subcontours, riding on the general tune for a sentence" or 
clause. Thus, as illustrated in Figure 10, F^ contours in sentences with 
several major constituents will have major F^ changes associated with the con- 
stituent structure. We might call these rapi41y riaing^^d gradually^ling 
F contours as "archetype constituent contours". They resemble lAeberman's 
(1967) unmarked and marked breath groups, and Pike's (1%5) primary contours 
plus precontours, and other contours associated with "sense groups" in the 
literature . 

We shall build a general hypothesis about F^ correlates of stress based 
on archetype contours within constituents. It appears the rising F^ near the 
beginning of a constituent is attributable to the first stressed syllable in 
the constituent (Lea, Medress, and Skinner, 1972b). An algorithm for stressed 
syllable location should thus search in the region of the peak F^ in the 
constituent, and the rising F^ region proceding the peak. In fact, it appears 
that the F^ rise that marks the beginning of the "constituent" found by the 
boundary detector is associated with this following stressed syllable. In a 
sense, then, the constituent boundary detector may be said to be detecting 
some stressed syllables (but not locating them). If each constituent had 
exactly one lexical word with a major-stressed syllable within it (as has been 
suggested for deep structures; cf. Chomsky, 1965? Snonds, 1970), we might 
expect the present method of constituent detections to be closely associated 
with the presence of stressed syllables. 

In fact, however, surface structure constituents (both as predicted 
syntactically and as founa by the boundary detector) sometimes contain more 
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than one streeaed. syllable (as In the constituent 1r) to w i T l Y ^fl^l^f^l colorg 
in the Eainbow Script). Based on pirevioue studies showing higher and 
rising F to be associated with strosaed syllables, we might eacpect that the 
extra atressed ayllablea in the conffltltuent will be accompanied by local 
increaaea in F , above the general archetype pattern. Since these additional 
atreaaed ayllablea are aaaumed to follow the first stressed syllable associated 
with the peak F , any iaureasss In associated M±th them will be manifested 
by bumps (tempo^^ry increases in F^) above the archetype falling F^ contours, 
as shown in Figure 11. 

This general strategy regarding F^ correlates of stress will not detect 
all stressed syllables in all of sp^Jech. When special emphasis, specific 
"marked" semantic attitudes (such as unbelief, distrust, etc.), or other non- 
noimative non-neutral expression foirms are intended by the speaker, he may 
show sudden decreases iii F^ on stressed syllables (cf . Pike, 1945j Bolinger, 
1958; Lea, Ifedress, and Skinner, W2a, pp. 35-6). Also, some constituent 
atructurea do not always ahow highest F^ on the firgt atreaaed ayllable in a 
conatituent, but rather on aome later atreaaed ayllablea. Thia will introduce 
caaea wtere a atreaaed syllable ia rot located by an algorithm baaed on the 
archetype contoura. 

5.2 Enarerv-Intagral Cue? to Stress 

Early studies of acousLxc correlates of stress showed that vowel durations 
were longer, and yowel intensities were higher in stressed syllables (Fry, 1955; 
1958; Ideberman, I960). Indeed, the earliest works (Saussure, 1915; Jones, 
1932) equated high intensity and stressedness. However, later studies showed 
that F was usually the best of the three individual correlates (Fry, 1958; 
lieberLi, I960; Bolinger, 1958). Then, Ideberman (I960) showed that the energy/ 
values integrated over the vowel or the total syllabic duration gave the begt 
cue to stressed syllables. >fedress and Skinner (1971) found that the energy- 
integral (over the vowel) was the strongest cue to stress, most successfully 
determining the stressed vowel in m^atisyllabic words, both in isolation and 
in short sentences. 
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FIgui-e 11. Increases in Fq, Above the Archetype Contour for a Constituent, 
Are Assumed to be Associated with Stressed Syllables. 
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The energy Integral, which Incorporates both durations of integration 
and int;n8ititiea at each point within that period, Is affected by phonetic 
content of the words, and by the positions within Intonation contours of 
total structures. Vowels Articulated with low tongue positions, such as 
/a , a/ are more intense (by as much as 6 db) and longer (by as much as 
25%) than those with high tongue positions, such as /l, u/ (House and 
Fairbanks, 1952j Lehiste, 1970)... Tense vowels are also longer thap lax 
vowelfl (Delattre, 1962). Vowels are longer when followed by voiced 
consonants than when followed by unvoiced consonants. Vowels in unvoiced 
consonantal environments tend to b^ less intense (House and Fairbanks, 
1952). The manner of articulation of following consonants can alsc affect 
the durations of vowels. Finally, word- or phrase-initial vowels tend to 
be more intense than word-final, phrase-final or utterance-final ones, 
while phrase-final syllables tend to Jbe longer In duration than medial or 
initial syllables (Lehiste, 1970} Ifattingly, 1966). The phrase-final (or 
so-called "prepausal") lengthening of syllables appears to be different 
for stressed and unstressed syllables (Oiler, 1971). 

Mbrton and Jassem (1965) showed that about 6 db or more is needed 
between the intensity levels of syllables to successfully mark stress. 

; Intensity variations of 3 db or less are insignificant perceptually. 
Syllabic nuclei (vowels and prevocaUc or postvocalic non-vowel consonants) 
are at least 6 do more intense than inters/llablc consonants. Thus, 

• syllabic segmentation of speech wcJuld presumably involve 6 db variations 
in intensity. , 

Based on these various studies of duration and intensity, and their 
relationships to stress, a general strategy of stressed syllable location 
from energy integrals can be outlined. Within the constituents detected 
by the boundary detector, and near the positions of peak F. and local , 
increases in F^ above the archetype contour, a search should be made for 
periods of high intensity, yielding large energy integrals, bounded by 
dips in energy presumed to mark syllabic boundaries. These dips should 
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be on the order of 6 db. Given several high energy regions in the vicinity 

of an F Increase, one should select one \^th highest energy integral and 

o , \ . 

non-falling or rising F^. ' 

It is conceivable that a number of detailed refineanents to this general 
strategy could maximize the accuracy of stressed syllable location. Among 
such refinements could be adjustments to account for intrinsic F^, intensity, 
and duration of various vowels, to account for effects of surrounding conso- 
nants, and to account for positions within total structures and intonation 
contours. 

5.3 An Algorithm- for Stre pff?'^ Sy^-lnbla location 

As Shown on the- example computer listing in Figure 12, the constituent 
boundary detection program provides markers ("SYNTB") for the positions of all 
detected sjfntactic boundaries/plus markers ("MAIFO") at the (first) position 
(time imx) of maxlmnn F^ in each constituent. These are used as starting 
data for the stressed syllable location algorithm. The algorithm for stressed 
syllable location, which is detailed in Figure C-1 of Appendix C, proceeds by 
first locating the HEAD stressed syllable in the constituent, then finding 
any other stressed syllables between the HEAD and the end of the constituent. 
Presently, no details are Included to normalize for vowel identity, phonetic 
context, or position within the total Intonation contour (such as utterance- 
final, tosalizatlon positions, etc.; cf. Lea, Msdress, and Skinner, 1972a, 
p. 37). In this section we shall sketch some of the main points and detailed 
decisions Involved in the stressed syllable location algorithm. A flow chart 
is given in Figure C-1 of Appendix C. 

5,3.1 ^P^^ ,^ ^Tlg the Flrat Stref fF^i "^^T^IlM? 1" Constituent 

To find the HEAD of a conatltuent . the algorithi begins with the position 
ITMAX of maximum F^ in the constituent. If contiguous points after ITMAX 
maintain that saine°maximmi F^ (forming a plateau), the center point of such 
constant-F^ polntKls called the "Tione of Peak" (TOP). (See Figure 12,where 
the three segments i^llowlng MOFO maintain the same F^ value.) If, however, 
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P falls after ITMAI, and the utterance was unvoiced for two or more time 
segments dmediately before ITMAX, then a check is made on values just 
before the unvoicing began. If before the unvoicing was within a 
threshold percentage (presently THMAI = 20$) of F^ at ITMAI, and if P„ had 
been non-falling or rising in the five tline segments before unvoicing, then 
set TOP equal to the time of the last voiced segment just before unvoicing. 

The Time of Peak (TOP) gives a reasonable starting point from which to 
search for the first stressed syllable. The test for previous unvoicing is 
to allow for the fact that P^ may be higher Umnediately after voicing onset 
after an unvoiced consonant than it is in the previous syllable, even 
though the |)revious syllable may be the more stressed of the two syllables. 
This refinement is also needed to provide a more reasonable starting point 
for the archetype falling contour to be assigned following the HEAD of the 
constituent. Proper setting of TOP can significantly affect the slope of 
the archetype following contour. 

The next step is to search for the likely location of the stressed 
syllable near TOP which will fom the HEAD of the constituent. Within some 
length of time BACKT (presently four hundred milliseconds) before TOP and 
some threshold time FOHWT (now three hiMidred iuiUltaconds) after TOP, a 
search is made for all dips in energy of a threshold amount EDIP (now set 
at 5 db variations for the broadband energy function defined by Lea, Madress, 
and Skinner, 1972a, p. 23). (The most efficient means for finding these and 
other energy dips used in stressed syllable location is to precede the 
stressed location program by a little program that fJjids and marks all peaks 
and dips in the energy contour, just as the boundary detector provides for 

If only two dips occur within the time interval defined by BACKT and 
i^EWT, then the high energy portion bracketed by those dips will form the 
tdjue temporarily assumed to be associated with the HEAD of the constituent. 
(With the present values of BACKT and FORWT this is highly unlikely, and the 
700 ms will need to be divided into two or more syllables by other procedures 

67 



56 



Report No. PX 10U6 

described below.) If more than two Ls occur within the bracketed time of 
BACKT and FORWT (as is the case in tie example of Figure 12), tests must be 
xnade for which high energy portion between dips is "bo be called the stressed 
HE/U). 

First the energy Integral ENERGY is defined for each portion between 
dir.. in the bracketed time region (portions before the first dip but after 
the beginning of MCKT, and after the last dip but before the end of FOBWT 
are neglected in this comparison of energy integrals). (This energy integral 
specification might be most efficiently determined by the prell^ninary program 
that finds energy peaks and dips.) The energy Integrals are presently found 
^ simply-s«ng tixe ene^^^^^ an tl^e segments between th^^ps^. 

Where appropriate, the relative -izes of these ebergy integrals may be used 
to select the portions which are the stressed syU^bles. 

The present algorithm assumes a" preeminence of F^ as a stress cue, so, 
before considering energy integrals, an F^ test is made. Of the ^ever^ ^ 
high energy portions within the bracketed time, find all those which exhibit 
an overall rise in F^ during the time that the energy do^s not dip below its 
^um by more than 3 db. That is, F^ at the first polAt where energy is 
within 3 db of maximum (such as time segment labelled 150 in Figure 12) must 
be less than F at the last point (such as time segment 280 in Figure 12) 
before energy ^ops 3 db below the maximum. If more than two such portions 
have rising F , choose the first one unless the first one is only five or 
less time segments in length or unless the energy integral of (any of) the 
second or later one(s) is (are) markedly (presently A0% or more) greater 
than that.-for the first one. If no portions show rising F^, then choose the 
highest in energy integral. 

If the high energy HEAD so selected is very long (with 300 ms or more 
between its preceding energy dip and its following dl-p), then a test will be 
made for more than one stressed syllable within it. Sometimes two or more 
syUables without intervocalic obstruents wiU show no significant (5 db) 
energy dips, and would appear to fom a single "stressed syllable". If. 
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there is a ,'»n».n dip of at least 2 db lasting for two or/more time segments, 
breaking the apparent HEAD into two high energy portion^ each of at least 
100 ms in duration, and if in the later portion is abW the archetype 
contour to be defined below, this second portion will be\labelled as another 
stressed syllable, distinct from the HEAD. 

5.3.2 Finding Other Stressed Syllables In a Constituent 

Having found a stressed syllable corresponding to the| HEAD for each 
constituent, the stressed syllable location algorithm next searches for other 
stressed syllables within each constituent. First, the TT.AIL (time of the 
TAIL) of the F contour is defined as the center of the last plateau or 
bottom of the last small {2% or greater) valley of F^ within the constituent 
(such as time segpient 85O in Figure 12), not including the plateau or valley 
bottom that the boundary. position is set within. Next, a linear archetype 
plot on the eighth-tone (logarithmiq) F^ scale is drawn .from the eighth-tone 
value at the TOP- to the eighth-tone value at the TAIL. 

Then a search is made for all Instances, after the energy dip marking 
the end of the HEAD and before the TTAIL, where the eighth-tone value of F^ 
for five or more consecutive segments is greater than that defined by the 
archetype line. (When the HEAD is longer than 300 ms so that two or more 
stressed syllables might be included in the HEAD, the test for increases in 
F above the archetype begins at 100 msec before the end of the HEAD, or at 
the, small 2 db energy dip defining a possible syllable boundary, whichever 
is first.) If F in eighth tones is above "^.he archetype line for the 
minimum duration (presently set at five' time segments) and if F^ is rising 
during that time, or flat, then the high energy portion (more than 60 ms 
in duration) , bounded by 5 db dips which is associated with this F^ rise is 
called another stressed syllable in the constituent. To determine which 
high energy portion is associated with this non-falling F^ stretch above 
the archetype line (that is, to establish the bounds of this other stressed 
syllable), a search for nearby high" energy portions is made. If no energy 
dips occur in the time that F^ is non-falllng, then the stressed syllable 
extends to the immediately preceding and following 5 db dips (such as at 
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ti^e segments 410 and 660 in Figure 12). If dips do occur during the non- 
falling portion of the contour above the archetype line, the same tactics 
for selection apply as vdth HEADS; namely, the first stretch with rising F^ . 
and high energy is chosen unless it is too short (less than 60 ms) or lower ; 
in energy integral by or more than a following high energy portion whose 

F is still above the" archetype . 
o 

One case is also allowed where a falling F^ which is still above the 
archetype line can be declared a stressed syllable. If, for six or more 
tme segments, F^ is above the archetype line but falling, a search for 5 db 
energy dips in that area is undertaken. Between two dips, determine the 
total portion that is vithin 3 db of the BiaxlMm intensity. If F^ does not 
fall more than an average of two eighth-tones per five time segments within 
this high energy portion, then that portion is also declared a stressed 
syllable. This allows stressed syllables where F^ had been falling rapidly, 
but was locally increased -^above the , archetype tci be a much more gradual fall. 
Thus , the tocrease in F^, kbove what Mght have' been, really marks the presence 
of a stressed syllable, even if the F^ is not rising absolutely. 

54 o-^T-nr-r- /^ V^^-t.h^n. Tocation. With' Perceived Stress Patterns 

The stressed syllable location algorithm has not been Implemented as a 
computer program. However, it has been followed strictly in a hsad.an^ 
of "acoustic cues to stress patterns for th^ speech texts listed in section 2. 
The results of such algorithmic locations/of stressed syllables were compared 
with the perceptions of stress. The complete sets of algorithmic results are 
shown in Figures 02 to Oil in Appenc^ ,C. Those figures show the texts as 



spoken by the various talkers, with a iox around each syllable that was 
perceived as stressed by two or more Asteners (that is, that had a str. 
score of +2 or +3; see section 4..2) . J Also shown in the figures are lines 



underscoring all those portions of the texts that were found within the high 
energy portions declared as "stressed syllables" by the algorithm. 

Thus, for example. Figure 03 shows that the syllables gun-, gtrlkes, , 
rai^-. air, ast, prla-, £2™ and rgln- were perceived as stressed iji the 
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first sentence of the Rainbow Script as read by taUcer GWH. However, the 

algorithm found Him, "^^"T^^^^- aijl^, Esitt-, to tb9 ft4r , asi, EEifi.-» 

and caia as included in the high energy portions declared to be "stressed 
syllables". Thus, it gave a "false" detection of ^ifeen as stressed, missed 
the stressed syllable fom, and included within some "stressed syllables" 
portions which were unstressed. Extended voiced sequences, and especially 
sonorant sequences, may have no significant energy dips, so that sequences 
such as In the air , - orizon . boiliat, wtl^n a MB loo^8> ib§_SasL etc. may 
be included in "stressed syllables". As long as a stressed syllable is 
included within each such stretch, we may consider that no false alarm has 
occurred, and that that stressed syllable has been correctly located. 
However, if two stressed syllables were included within the single stretch, 
we would consider one correct location (and one miss). 

Stretches which the algorithm declares stressed but which did not include 
any syllable with a stress score of +2 or +3 are considered "false" stress 
detections (e.g., ite in Figure C-3).' One major source of such false 
alarms is a false boundary detection (e.g., as in the middle of the word 
contain in ARPA Sentence 3 shown in Figure C-IO). When false boundaries are 
assigned, they demand that a stressed HEAD be found in each of the surrounding 
constituents (so that, e.g., con- must be a stressed HEAD since it is a 
constituent). Some located portions also occur where listeners WAL and MBM 
perceived a syllable as stressed, but since listener XES perceived it as 
reduced, it was assigned a stress score of +1. With a more consistent set 
of -listeners, these may be perceived as stressed and the location would be 
correct . 

The stress scores marked on the false locations and missing locations 
in Figures C-2 to Crll show that many false alarms were on syllables with 
stress score +1 (perceived as stressed by at least one listener) , while 
most misses were on syllables of score +2, where not all listeners agreed 
the syllable was stressed. 
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Table IV simimarizes the stressed syllable location resists from the 
hand analysis with the algorithm. Shown for each text and/talker are the 
numbers of syllables perceived as stressed, the number e^f those found by 
the algorithm, and the consequent percentages of alL stressed syllables 
that were correctly detected. Also shown are the n^rs of false loca- 
tions in each run, and the percentages of all loca/ions by the algorithm 
that were false (that is, did not include syllables perceived as stressed). 

While scores varied some from text to t^Xt and talker to talker, the 
overall scores of 78^ to 98^ (average, 85^) correct location of stressed 
syllables are very encouraging. The Monosyllabic Script, with its prominent 
stresses on monosyllabic words, yielded ^uite high scores. The spontaneous 
ARPA sentences, which were more monoton4 and which gave some difficulties 
to the boundary detection algorithm, showed the lowest stre^ssedjyUable' 
location scores. ^^-"'''''^ 

The false alarm rate^ were fairly high, ranging from 7^ up to 28^. 

Some of the false alarms will be eliininated by lonprovements in the boundary 
detector. Some other "false" locations are not necessarily bad, since one 
or two listeners did perceive those syllables as stressed. A few of the - 
false alams may be eliminated by not demanding stressed HEADS in short 
constituents (such as those less than 200 ms in duration). Further studies 
are needed to reduce false alam rates and siomxltaneously maintain or improve 
. the scores for correct locations. ULtlmately, the design of a better algorithm 
' for stressed syllable location must b^ based on a strategic decision as to 
whether it is better to have some false alams and correspondingly increase 
the success in correct location or to have little or no false alarms but at 
the sacrifice of lower scores in correct location. This will substantially 
'depend upon the specific uge of stressed syllable infoimtion in other 
aspects of the speech understanding system. For guiding distinctive features 
estimation procedures, all that might come from having a few false' locations 
is that distinctive features analysis may occasionally be applied (perhaps 
wastefully or with some difficulty) in the somewhat-less-reliably-^ncoded 
.unstressed syllables. 
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It would be of interest to coinpare the substantial success in\ stressed 
syllable" location which was attained with the present algorithm wiih results 
that might be attained with other algorithms, such as simpler ones \that 
merely look for aTL peaks, or for all high intensity portions o^^ high 
energy integral portions of the speech. These and other further sttzdies in 
stressed syllable location and constituent boundary detection will be out- 
lined in section 6. 
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6. CONCLUSIONS AND FURTHER WORK 
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In this report, methods have been described for aegmentlng speech into 
grammatical phrases and identifying stressed syllables in continuous speech. 
The program for detecting syntactic boundaries from fall-rise patterns in 
voice fundamental frequency contours has been shown, both by the present 
study and by previous studies, to succeed in finding over S0% of all 
syntactically predicted boundaries between major syntactic units. It also, 
however, detects some syntactic boundaries not predicted by the intuitive 
constituent structure analysis previously applied, and detects false 
boundaries not apparently related to syntactic structure, such as at 
consonant-vowel boundaries. 

The algorithm for stressed syllable location has succeeded in locating 
around B5% of all syllables perceived as stressed by the majority votes of 
a panel of listeners. The procedure identifies stressed syllables with 
high energy-integral portions of the speech which exhibit rising or non- 
falling F^, but it does so in a way which makes use of constituent 
boundaries and archetype F^ contours. Simpler procedures might conceivably 
work as well, and there is obviously room for improvement in the present 
location scores. 

Besides such algorithmic results, the other major aspect of research 
reported herein has been concerned with the perceptions of stress levels 
by three listeners. Two listeners were found to agree in their perceived 
stress levels for most of the individual syllables in the Rainbow Script 
and Monosyllabic Script, and ARPA man-^nachine interaction sentences. They 
differed on only about 5% of all syllables as to whether they were stressed 
or not, and each of them showed only about 5% confusions in decisions about 
stressed syllables from one trial to another, Ifastressed and reduced levels 
were much more frequently confused. A third listener differed from the 
other two listeners on about half of his stress level judgments. About 20 
to 25% of all syllables were labelled stre^ssed by the other listeners, but 
unstressed by this third listener. This listener also labelled substantial 



64 



uNmc 

Report No. PI 10146 

peroentagea of all syllables as aiifififlsA on one trial and n n fltr98B94 on 
another. Such listeners who are inconsistent in their own Judgments and 
who differ dramtically from other Usteners should be excluded in any 
attempts to establish standards about which are the actual "stressed 
syllables" in connected speech. v^^^. 

The listeners appear to be as consistent in their assignments of 
stress levels given only the written text as they are in their assignments 
when listening to the speech recordings. However, their judgments liit^ 
speech do not correspond well with their judgments jdik speech if the 
speech is spontaneous (that is, not produced by speakers reading written 
texts). Listeners apparently differ most dramatically from each other, 
and yield more confusions in stress levels from repetition to repetition, 
when yes-no questions are involved. 

The majority stress perceptions from three trials by each listener, 
when pooled so as to yield the sum plots as shown in Figure 2^ provide a 
"standard" for determining all stressed syllables which is stable to within 
about 5%. This is suitable for evaluating an algorithm for locating 
stressed syllables to within a 5% tolerance in overall location scores. 

Several forms' of further work are needed. The program for constituent 
boundary detection can be refined to produce fewer false alarms by requiring 
each' new f' maximmi or minimm to remain beyond the 7$ thresholds for at 
least 20 m°s. It would be desirable to remove or augment the strict dependence 
on a fixed {7%) threshold for F^ changes, and to incorporate an overaU. 
confidence measure for each boundary, based on the percentage decrease in F^ 
before the apparent boundary, the percentage increase after the boundary, 
the shape of the contour near the boundary, and the time between that boundary 
"and the immediately preceding or following ones. Thus, cusp-like changes at 
■ unvoiced consonants (of the f orm - ^ ) and very brief F^ dips or jumps (of 

such forms as — ) «ay be assigned very low likelihood of being boundaries, 
while major gradual changes (of the fom-V^) would be assigned higher 

confidence/ratings. One or both of two boundaries separated by short tiiaes 

(in the order of 200 ms or less) might be considered suspect, and assigned a 

low confidence rating. 
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The boundary predictions shoiad be improved by defining and applying a 
strict set of rules foy syntactic bracketing and prediction of intonation 
contours. Intonation rules such as Blerwisch (1966) produced for German 
are needed, along with the selection of an adequate grammar to define the 
syntactic structure that would be part of the input to such intonation 
rules. Working with Jane Robinson from the University of Michigan, we hope 
to apply such rules to texts such as those used in the present studies. 

The algorithm for locating stressed syllables must be Inqplemented as 
a computer program and tested carefully to see that it performs at the level 
of success* attained in the previous hand analyses. Also, several improve- 
ments are needed. Among those to be Investigated are better procedures for ^ 
defining the TAIL of a constituent, a careful "tuning" of all the parameters 
and detailed 'steps for selecting HEADS and other stressed syllables, use of 
a low-frequency "sonorant" energy function rather than the present broadband 
energy function (so that better syllabication might be attained), and the 
incorporation of procediires for locating other possible stressed syllables 
before the HEAD (or peak position) when the peak occurs late in a 
constituent (say more than 4D0 or 500 ms after the • preceding boundary). 

It also seems reasonable to conqpare the results with the present stressed 
syllable algorithm (either before or after it is iii5)lemented as a con^niter 
program) with results in- stressed syllable location by other possible procedures. 
For example, if one called all long-duration portions where energy was above a 
threshold value as stressed syllables, how many of the perceived stressed / 
syllables would, be detected and how many false alarms would result? Alternatively, 
cDuld one get coiiqparable success by looking for all F^ rises or upward inflections 
and choosing the high energy portion nearest such places, without use of 
boundaries and archetype contours' in his procedures? 

Ifcre extensive experiments are needed wherein the various variables of 
sentence type, talker, lexical forms, phonetic content, position in sentence 
and intonation contour, and such could be independently controlled. Texts 
for such studies are now being designed (cf . Lea, Medress, and Skinner, 1972a, 
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pp. 56-57), and such studies will be conducted. In particular, such studies 
can test further the apparent difficulty in listeners' assigments of stress 
within yes^o questions, and the relative successes in boundary detection and 
stressed syllable location within questions versus declaratives or commands. 

The application of boundary detections and stressed syllable locations 
ti. guiding a partial distinctive features analysis must yet be .done. Tfatil 
some details of the distinctive features analysis are better defined, the 
question cannot be resolved as to whether higher "hit« rates or lower "false 
alam" rates are more lonportant to attain in the boundary detection or 
stressed syllable location algorithm. Also, techniques must be explored for 
applying boundary and stressed syllable Information to the aid of syntactic 
parsers. Such efforts will be critical to Implementing the proposed speech 
recognition strategy at Ifeivac. 
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- APPENDIX A: CONSTITUENT BOUNDARY nETECTION 5SESULTS 

The constituent boundary detection program marks boundaries between 
major syntactic units by locating the last time of mlnJimam value, In 
an F "valley" which is preceded by a 1% decrease in F^ and followed by a 
n% increase. A general flow chart of the procedure was published in Lea's 
thesis (Lea, 1972b, p. 206). A detailed flo;«.hart (available upon request) 
has been obtained by an automatic flow-charting routine at Sperry Univac, 
for that version implemented at Univac and used for boundary detection on 
the Monosyllabic Script and the ARPA Sentences. ^ 

Figures A-1 , A-2, and A-3 show the detected boundaries for the 
Rainbow Script (as spoken by six talkers), the Monosyllabic Script 
(as spoken by two tallcers), and the ARPA Sentences, respectively. Vertical^, 
bara mark predicted constituent boundaries that were detected; predicated . 
boundaries that were not detected (that" is, were "missing") ar. indicated , 
by asterisks at the positions of the syntactic breaks. - Boundaries between ^ 
xninor syntactic constituents (that were detected Mt not predicted) are 
shown by columns of dots, and "false (syntactically unrelated) boundarxes 
that were detected are shown by question marks. Sentence boundaides were 
expected to be accompanied by both the vertical bars marking F^ - valley 
constituent boundaries and by pauses of 35 centiseconds or mor^, to be 
■ marked by S's on the vertical bars. When a sentence boundary Us not 
" accompanied by a sufficient pause, bat was detected as a constf-tuent 
boundary, it was marked by this symbol: . In the ARPA ^entences, 

occasional extra hesitation pauses occur that are not associated with major 
syntactic boundaries. These are marked i. Figure A-3 ^ S.s with columns 
of dots (not vertical bars). 

Table A-1 shows the boundary detection results for the 13 ARPA Sentences, 
separated into categories for each type of sentence. WH-questions and 
cor^nds show the most missing, or undetected, boundaries, thus yielji^g the 
lowest constituent boundary detection scores. Three cf the missing 
boundaries in the commands, and two of those missing in WH-questions are 
m compound noun constructions, which are certainly among the most minor of 
the predicted boundaries. Another missing boundary in a command is a 
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1. (LS21) 



2. (LM13) 



3. (B27) 



4. (BIO) 



5. (RB6) 



6. (RB16) 



Who's the owner of utterance eight? 

H I 

Display the phonemic labels above the spectrogram, 

i— H 



Do any samples contain troilite? 

I ' I 

What is the average uranium lead ratio for the lunar samples? 

-I * I :"^l 

Do you have any right square boxes left? 

1:1 



• * 



Put the other red block on the red block. 

: $ $ : 



7. (LM3) Who is the owner of utterance eight? 



8. (B35) 



9. (RA19) 



10. (RC8) 



11. (CBI3OO) 



12. {CB2300) 



13. (D10) 



Do any samples contain tridymite? 
I ? I 

-would you move the stack of right cii^cular cylinders to the right by half a square? 

? $ : $ I $ * * 

Place the red triangle two squares back from the front of tlje floor in the middle. 
I $ $ I — I i 

Alpha becomes alpha minus beta. 

I II- 

Alpha gets alpha minus beta, 

Mil 

Repeat where key word equals Gauss elimination or key word equals eigenvalues. 



FiguT^ A-3. Complete Boundary Detection Results for the 13 ARPA Sentences. 
Symbols mariting boundaries are explained in Figure A-1. 
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HP-Verbal boundary, which has been shovn to be a type of boundary ^ch Is 
frequently missing. Boundaries are also Bilssing after the WH-pronoun plus 
copulatives ("Who's" or "Who is"). Boundaries might be argued to be less 
likely there anyway, since previous results have shown that pronouns and 
copulatives both are less likely to be followed by detectable boundaries. 

The only boundaries in WH-questions and commands that are notable in their 
absence, then, are that after the command verb Display in LM13 and that 
before the preposition phrase of 1213. These misses are apparently due to 
.the montonic speech of that particular talker. 

We are left with little or no evidence that sentence type affects 
:^^lative boundary detection scores, except for the WH-pronoun effects. 

The extra pauses in the command and polite command are not necessarily 
results of sentence type, but are all hesitation pauses in the spontaneous 
protocols from Stanford Research Institute. 

Sir^e boundary detection scores were somewhat lower in the AEPA Seutences, 
and s±n6e the monotonic patterns in those spontaneous utterances seemed 
to be one factor in the results, a study was conducted, on the thresholds 
for detecting fail-rise valleys in F^, and how they correlate with boundary 
scores, for the 6AEPA sentences. In Figure A-4 is shown the number of - 
predicted, extra, and false boundaries detected in the 6ARPA sentences as a 
function of threshold. As the threshold is inqreased (that is, a boundary 
imxst be preceded by larger F^ decreases and followed by larger F^ increases), 
the number of false boundaries rapidly drops while the numbers. of predicted 
and extra syntactic boundaries decreases much more ^u^lly. Any threshold 
above 3% and below lO^ or so thus ^iiasinfitSTiaost'false ifejndaries while 
preserving the detection of most predicted boundaries. 

The threshold plotted along the abscissa in Figure A-4 is the smaller 
of the two thresholds for percentage rise in F^ and percentage fall in Fq. 
Past work (Lea, 1972b) has been conducted with both thresholds equal. 
The Univac ojnplementation permits unequal fall and rise thresholds. 
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F.9ure^-4. Effct. of Thr..ho.d Size on Boundary Detection R..ult.. for the 
6 ARPA Sentence*. 
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Figure A-4 shows some results with unequal thresholds. Plotted at the 
smallest- threshold value of yf> are: 

(a) the results with a minimun fall of 1% required while only yf> 
foil "Wing rise is required, symbolized by the pair (7,3); and 

(b) the results with a minimum fall of only 3% required while a 
756 following rise is required, symbolized by the pair (3,7). 

Thus, when the fall thresiiold (the first in the pair) is greater than the 
rise threshold, more predicted and extra syntactic bO^aries are correctly 
detected, and less false boundaries are detected, than if the rise threshold 
were greater than the fall threshold. Of course, fewer boundaries of all 
types are detected if both thresholds are increased, but for nonequal 
thresholds, the fall threshold should be greater than the rise threshold. 
° This is to be expected when one considers the general falling contours of 
Fo or intonation in English (see figures 9, 10, and 12 of this report). 
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APPENDIX B: DETAILS CF PERCEIVED STIRESS PATTERNS 

Figure B-1 Illustrates a sheet on which the perceived stress levels of 
one listener are recorded for one recorded text, the 6ARPA Sentences. 
Similar sheets were obtained for each trial with each listener, each text, 
and each ta]Jcer. Stressed, unstressed, and reduced syllables were marked 
as S, U, ard R, respectively, by this listener (MF14) and another listener 

(WAL). Listener TES labelled them as levels 1, 2, and 3, respectively. 
Vertical lines delisted syllables, to facilitate marking for every 

syllable . 

Figures B-2 to B-1 5 summarize the majority perceptions from three 
repetitions for three listeners. The majority perceptions for each listener 
were first obtained (for each text and talker) from three repetitions. 
Then the number of majority votes of a syllable as stressed, minus the 
number of votes as reduced, were plotted under each syllable ("unstressed" 
judgments were thus assigned zeros, neither adding to nor subtracting 
from the syllable's stress/score). Figures B-2 to B-8 show the results 
for the Rainbow Script spoken by six talkers and for the NO SPEECH 
condition where only the written text was provided to the subjects. 
Figures B-9 to B-11 show results for the Monosyllabic Script with two 
talkers and NO SPEECH conditions. Figures B-12 and B.13 are corresponding ^ 
SPEECH and NO SPEECH results for 6ARPA, while B-U and B-15 are for 7ARPA. 
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STRESS PERCEPTIONS ON AR1\^ SENTENCES 



Listener 



Date 



iJid 



2i. 



1521: ,, . 

uJho^g the oxixiz^r 

I I I 1 T 



■iittoranoe 



"'"a , 5 A A/,k u 

jDd splay j the j phcjuejmic j l3j;o3ls 



)Ove 



the 



spec i.roKram. 



j Do aik srjrhle s corjtaln troitLite? | 



BIO: 



the a-j:er)jage] u|ra{ai[LOT| leadj ra|tijo| fcrj the| lujnaT-| sanfples 



RB6- 



Do I you I Live | anjr | right | square | boxjis J lex t ? 



RB16: 



Put I thk\ othpv] red | block | on j the] red 



bl'^k. 



FRir 



Figure B-1. Sample of the Sheets Used for Marking Stress Judgme^s 
(Listener MFM) u SI 
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Figure B-12. Summary of Stress Judgments by Three Listeners, 
for thie 6ARPA Sentences as Spoken. 
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Figure B-13. Summary of Stress Judgments by Three 'Listeners', 
When Given Only the Written Text of the 6ARPA Sentences (NO SPEECH). 
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Figure B-14. Summary of Stress Judgments by Three Listeners, 
for the TAR PA Sentences as Spoken. 
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Figure B-15. Summary of Stress Judgments by Three 'Listeners', 
When Given Only the Written Text of the 7ARPA Sentences (NO SPEECH). 
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A flowchart of the stressed syllable location algorithm is shoim in 
Figure G-1 . This is a characterization of the hand analysis procedure^ 
and may have to be modified and specified in more detail for implementation 
as a computer program. 

The results of applying the algorithm to stressed syllable location 
for each of the recorded speech texts are shoim in' Figures C-2 to G-11. 
The figures show the majority stress scores above each syllable • Those 
syllables perceived as stressed by two or more listeners (i.e., SS = 
+2 or +3) are shown in boxes. The syllables or speech portions which 
were decleu:*ed to be stressed by the algorithm are shown underlined. 
Whenever an underlined portion includes a boxed-in stressed syllable, a 
correct location has been obtained. Gases where' an underlined portion 
did not include a boxed-in syllable (that is, no part was perceived as 
stressed by two or more listeners) are false locations of stressed 
syllables. Many of these false locations resulted from false constituent 
boundary detections, since! the present procedure demands that every 
detected constituent have a stressed HEAD. 
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